Skip to content

ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary#1431

Closed
zratkai wants to merge 2 commits into
apache:mainfrom
zratkai:ORC-1384
Closed

ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary#1431
zratkai wants to merge 2 commits into
apache:mainfrom
zratkai:ORC-1384

Conversation

@zratkai

@zratkai zratkai commented Mar 7, 2023

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Avoid ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary. Check the size of the dictionary and input and read only the min of those.

Why are the changes needed?

In Hive when reading with LLAP data is read in 4kB blocks which leads to ArrayIndexOutOfBoundsException when the dictionary is smaller.

How was this patch tested?

It is tested with HIVE's qtest, since here we do not have the necessary subclasses.

@github-actions github-actions Bot added the JAVA label Mar 7, 2023
@guiyanakuang guiyanakuang changed the title ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary s… ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary Mar 8, 2023
Comment thread java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
@deshanxiao

deshanxiao commented Mar 8, 2023

Copy link
Copy Markdown
Contributor

Thank you for making the PR.
Could you make the CI happy with mvn checkstyle:check -Panalyze -Pbenchmark ? @zratkai

[INFO] There is 1 error reported by Checkstyle 9.3 with checkstyle.xml ruleset.
Error:  src/java/org/apache/orc/impl/TreeReaderFactory.java:[2295] (sizes) LineLength: Line is longer than 100 characters (found 114).

Comment thread java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
Comment thread java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
@deshanxiao

Copy link
Copy Markdown
Contributor

Actually, the issue blocked the upgrade of ORC in Hive:
apache/hive#3833

@zratkai

zratkai commented Mar 8, 2023

Copy link
Copy Markdown
Contributor Author

@deshanxiao thanks for adding this info. Yes, exactly, this was the original issue I started with. To finish it this fix is needed.

…tream bigger then dictionary

### What changes were proposed in this pull request?
Avoid  ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary. Check the size of the dictionary and input and read only the min of those.

### Why are the changes needed?
In Hive when reading with LLAP data is read in 4kB blocks which leads to ArrayIndexOutOfBoundsException when the dictionary is smaller.

### How was this patch tested?
It is tested with HIVE's qtest, since here we do not have the necessary subclasses.

This closes apache#1384
@dongjoon-hyun dongjoon-hyun changed the title ORC-1384 Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary Mar 8, 2023

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @zratkai .

  • Do you happen to know what caused this regression?
  • Do you think you can add a small unit test which you described in the commets?

@dongjoon-hyun

Copy link
Copy Markdown
Member

Also cc @williamhyun

@dongjoon-hyun dongjoon-hyun added this to the 1.8.3 milestone Mar 8, 2023
@zratkai

zratkai commented Mar 9, 2023

Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun

Thank you for making a PR, @zratkai .

  • Do you happen to know what caused this regression?
  • Do you think you can add a small unit test which you described in the commets?
  1. It is caused by LLAP in Hive, since it reads in 4kB blocks as I described above.
  2. I was thinking about that, but it is not easy for this modification, since it is in a private method, and we don't have any test even for the public method which calls this (org.apache.orc.impl.TreeReaderFactory.StringDictionaryTreeReader#nextVector) . So it would mean I need to create for that as well from scratch.

@dongjoon-hyun dongjoon-hyun changed the title ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary Mar 9, 2023

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
Merged to main/1.8 for Apache ORC 1.8.3

dongjoon-hyun added a commit that referenced this pull request Mar 9, 2023
…y stream bigger then dictionary

### What changes were proposed in this pull request?
Avoid  ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary. Check the size of the dictionary and input and read only the min of those.

### Why are the changes needed?
In Hive when reading with LLAP data is read in 4kB blocks which leads to ArrayIndexOutOfBoundsException when the dictionary is smaller.

### How was this patch tested?
It is tested with HIVE's qtest, since here we do not have the necessary subclasses.

Closes #1431 from zratkai/ORC-1384.

Lead-authored-by: Zoltan Ratkai <zratkai@cloudera.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 8cf9057)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you, @zratkai and @guiyanakuang .

Also, cc @williamhyun

@dongjoon-hyun

Copy link
Copy Markdown
Member

BTW, welcome to the Apache ORC community, @zratkai .
ORC-1384 is resolved and assigned to you.

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…y stream bigger then dictionary

### What changes were proposed in this pull request?
Avoid  ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary. Check the size of the dictionary and input and read only the min of those.

### Why are the changes needed?
In Hive when reading with LLAP data is read in 4kB blocks which leads to ArrayIndexOutOfBoundsException when the dictionary is smaller.

### How was this patch tested?
It is tested with HIVE's qtest, since here we do not have the necessary subclasses.

Closes apache#1431 from zratkai/ORC-1384.

Lead-authored-by: Zoltan Ratkai <zratkai@cloudera.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants